{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "sUtoed20cRJJ"
   },
   "source": [
    "# Load CSV, Numpy, and Text data in TensorFlow\n",
    "\n",
    "\n",
    "\n",
    "## Learning Objectives\n",
    "\n",
    "1. Load a CSV file into a `tf.data.Dataset`. \n",
    "2. Load Numpy data\n",
    "3. Load text data.\n",
    "\n",
    "\n",
    "\n",
    "## Introduction \n",
    "\n",
    "In this lab, you load CSV data from a file into a `tf.data.Dataset`.  This tutorial provides an example of loading data from NumPy arrays into a `tf.data.Dataset` you also load text data.\n",
    "\n",
    "Each learning objective will correspond to a __#TODO__ in this student lab notebook -- try to complete this notebook first and then review [the solution notebook](../solutions/load_diff_filedata.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "fgZ9gjmPfSnK"
   },
   "source": [
    "## Load necessary libraries \n",
    "We will start by importing the necessary libraries for this lab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "baYFZMW_bJHh"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TensorFlow version:  2.6.5\n"
     ]
    }
   ],
   "source": [
    "import functools\n",
    "\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "\n",
    "print(\"TensorFlow version: \",tf.version.VERSION)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Ncf5t6tgL5ZI"
   },
   "outputs": [],
   "source": [
    "TRAIN_DATA_URL = \"https://storage.googleapis.com/tf-datasets/titanic/train.csv\"\n",
    "TEST_DATA_URL = \"https://storage.googleapis.com/tf-datasets/titanic/eval.csv\"\n",
    "\n",
    "train_file_path = tf.keras.utils.get_file(\"train.csv\", TRAIN_DATA_URL)\n",
    "test_file_path = tf.keras.utils.get_file(\"eval.csv\", TEST_DATA_URL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "4ONE94qulk6S"
   },
   "outputs": [],
   "source": [
    "# Make numpy values easier to read.\n",
    "np.set_printoptions(precision=3, suppress=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Wuqj601Qw0Ml"
   },
   "source": [
    "## Load data\n",
    "\n",
    "This section provides an example of how to load CSV data from a file into a `tf.data.Dataset`.  The data used in this tutorial are taken from the Titanic passenger list. The model will predict the likelihood a passenger survived based on characteristics like age, gender, ticket class, and whether the person was traveling alone.\n",
    "\n",
    "To start, let's look at the top of the CSV file to see how it is formatted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "54Dv7mCrf9Yw"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone\n",
      "0,male,22.0,1,0,7.25,Third,unknown,Southampton,n\n",
      "1,female,38.0,1,0,71.2833,First,C,Cherbourg,n\n",
      "1,female,26.0,0,0,7.925,Third,unknown,Southampton,y\n",
      "1,female,35.0,1,0,53.1,First,C,Southampton,n\n",
      "0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y\n",
      "0,male,2.0,3,1,21.075,Third,unknown,Southampton,n\n",
      "1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n\n",
      "1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n\n",
      "1,female,4.0,1,1,16.7,Third,G,Southampton,n\n"
     ]
    }
   ],
   "source": [
    "!head {train_file_path}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "jC9lRhV-q_R3"
   },
   "source": [
    "You can [load this using pandas](pandas_dataframe.ipynb), and pass the NumPy arrays to TensorFlow. If you need to scale up to a large set of files, or need a loader that integrates with [TensorFlow and tf.data](../../guide/data.ipynb) then use the `tf.data.experimental.make_csv_dataset` function:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "67mfwr4v-mN_"
   },
   "source": [
    "The only column you need to identify explicitly is the one with the value that the model is intended to predict. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "iXROZm5f3V4E"
   },
   "outputs": [],
   "source": [
     "# TODO 1: Add string name for label column \n",
     "LABEL_COLUMN = '' \n",
     "LABELS = []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "t4N-plO4tDXd"
   },
   "source": [
    "Now read the CSV data from the file and create a dataset. \n",
    "\n",
    "(For the full documentation, see `tf.data.experimental.make_csv_dataset`)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "yIbUscB9sqha"
   },
   "outputs": [],
   "source": [
    "def get_dataset(file_path, **kwargs):\n",
    "# TODO 2\n",
    "# TODO: Read the CSV data from the file and create a dataset \n",
    "dataset = tf.data.experimental.make_csv_dataset( \n",
    "# TODO: Your code goes here.\n",
    "# TODO: Your code goes here.\n",
    "# TODO: Your code goes here.\n",
    "# TODO: Your code goes here.\n",
    ") \n",
    "  return dataset\n",
    "\n",
    "raw_train_data = # TODO: Your code goes here.\n",
    "raw_test_data = # TODO: Your code goes here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "v4oMO9MIxgTG"
   },
   "outputs": [],
   "source": [
    "def show_batch(dataset):\n",
    "  for batch, label in dataset.take(1):\n",
    "    for key, value in batch.items():\n",
    "      print(\"{:20s}: {}\".format(key,value.numpy()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "vHUQFKoQI6G7"
   },
   "source": [
    "Each item in the dataset is a batch, represented as a tuple of (*many examples*, *many labels*). The data from the examples is organized in column-based tensors (rather than row-based tensors), each with as many elements as the batch size (5 in this case).\n",
    "\n",
    "It might help to see this yourself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "HjrkJROoxoll"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sex                 : [b'male' b'male' b'male' b'male' b'male']\n",
      "age                 : [34. 18. 45. 46. 29.]\n",
      "n_siblings_spouses  : [1 0 1 1 1]\n",
      "parch               : [0 0 0 0 0]\n",
      "fare                : [26.     8.3   83.475 61.175  7.046]\n",
      "class               : [b'Second' b'Third' b'First' b'First' b'Third']\n",
      "deck                : [b'unknown' b'unknown' b'C' b'E' b'unknown']\n",
      "embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton'\n",
      " b'Southampton']\n",
      "alone               : [b'n' b'y' b'n' b'n' b'n']\n"
     ]
    }
   ],
   "source": [
    "show_batch(raw_train_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "YOYKQKmMj3D6"
   },
   "source": [
    "As you can see, the columns in the CSV are named. The dataset constructor will pick these names up automatically. If the file you are working with does not contain the column names in the first line, pass them in a list of strings to  the `column_names` argument in the `make_csv_dataset` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "2Av8_9L3tUg1"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sex                 : [b'male' b'female' b'male' b'male' b'male']\n",
      "age                 : [30. 50. 18. 51. 28.]\n",
      "n_siblings_spouses  : [1 0 1 0 0]\n",
      "parch               : [0 1 1 0 0]\n",
      "fare                : [ 16.1   247.521   7.854   8.05    7.05 ]\n",
      "class               : [b'Third' b'First' b'Third' b'Third' b'Third']\n",
      "deck                : [b'unknown' b'B' b'unknown' b'unknown' b'unknown']\n",
      "embark_town         : [b'Southampton' b'Cherbourg' b'Southampton' b'Southampton' b'Southampton']\n",
      "alone               : [b'n' b'n' b'n' b'y' b'y']\n"
     ]
    }
   ],
   "source": [
    "CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']\n",
    "\n",
    "temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)\n",
    "\n",
    "show_batch(temp_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "gZfhoX7bR9u4"
   },
   "source": [
    "This example is going to use all the available columns. If you need to omit some columns from the dataset, create a list of just the columns you plan to use, and pass it into the (optional) `select_columns` argument of the constructor.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "S1TzSkUKwsNP"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "age                 : [28. 34. 28. 50.  2.]\n",
      "n_siblings_spouses  : [0 0 0 2 1]\n",
      "class               : [b'Third' b'First' b'Third' b'First' b'Second']\n",
      "deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']\n",
      "alone               : [b'y' b'y' b'y' b'n' b'n']\n"
     ]
    }
   ],
   "source": [
    "SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'class', 'deck', 'alone']\n",
    "\n",
    "temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS)\n",
    "\n",
    "show_batch(temp_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9cryz31lxs3e"
   },
   "source": [
    "## Data preprocessing\n",
    "\n",
    "A CSV file can contain a variety of data types. Typically you want to convert from those mixed types to a fixed length vector before feeding the data into your model.\n",
    "\n",
    "TensorFlow has a built-in system for describing common input conversions: `tf.feature_column`, see [this tutorial](https://www.tensorflow.org/tutorials/structured_data/feature_columns) for details.\n",
    "\n",
    "\n",
    "You can preprocess your data using any tool you like (like [nltk](https://www.nltk.org/) or [sklearn](https://scikit-learn.org/stable/)), and just pass the processed output to TensorFlow. \n",
    "\n",
    "\n",
    "The primary advantage of doing the preprocessing inside your model is that when you export the model it includes the preprocessing. This way you can pass the raw data directly to your model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9AsbaFmCeJtF"
   },
   "source": [
    "### Continuous data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Xl0Q0DcfA_rt"
   },
   "source": [
    "If your data is already in an appropriate numeric format, you can pack the data into a vector before passing it off to the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "4Yfji3J5BMxz"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "age                 : [28.  32.5 28.  32.  28. ]\n",
      "n_siblings_spouses  : [0. 1. 0. 0. 0.]\n",
      "parch               : [0. 0. 0. 0. 0.]\n",
      "fare                : [26.55  30.071  7.829 13.     7.75 ]\n"
     ]
    }
   ],
   "source": [
    "SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']\n",
    "DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]\n",
    "temp_dataset = get_dataset(train_file_path, \n",
    "                           select_columns=SELECT_COLUMNS,\n",
    "                           column_defaults = DEFAULTS)\n",
    "\n",
    "show_batch(temp_dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "zEUhI8kZCfq8"
   },
   "outputs": [],
   "source": [
    "example_batch, labels_batch = next(iter(temp_dataset)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "IP45_2FbEKzn"
   },
   "source": [
    "Here's a simple function that will pack together all the columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "JQ0hNSL8CC3a"
   },
   "outputs": [],
   "source": [
    "def pack(features, label):\n",
    "  return tf.stack(list(features.values()), axis=-1), label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "75LA9DisEIoE"
   },
   "source": [
    "Apply this to each element of the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "VnP2Z2lwCTRl"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:AutoGraph could not transform <function pack at 0x7f52c0743ea0> and will run it as-is.\n",
      "Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\n",
      "Cause: 'arguments' object has no attribute 'posonlyargs'\n",
      "To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert\n",
      "WARNING: AutoGraph could not transform <function pack at 0x7f52c0743ea0> and will run it as-is.\n",
      "Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\n",
      "Cause: 'arguments' object has no attribute 'posonlyargs'\n",
      "To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert\n",
      "[[ 18.      1.      0.    108.9  ]\n",
      " [ 31.      0.      0.     50.496]\n",
      " [ 70.      1.      1.     71.   ]\n",
      " [ 24.      1.      0.     16.1  ]\n",
      " [ 31.      1.      1.     37.004]]\n",
      "\n",
      "[0 0 0 0 0]\n"
     ]
    }
   ],
   "source": [
    "packed_dataset = temp_dataset.map(pack)\n",
    "\n",
    "for features, labels in packed_dataset.take(1):\n",
    "  print(features.numpy())\n",
    "  print()\n",
    "  print(labels.numpy())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "1VBvmaFrFU6J"
   },
   "source": [
    "If you have mixed datatypes you may want to separate out these simple-numeric fields. The `tf.feature_column` api can handle them, but this incurs some overhead and should be avoided unless really necessary. Switch back to the mixed dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ad-IQ_JPFQge"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sex                 : [b'male' b'female' b'male' b'male' b'male']\n",
      "age                 : [18. 28. 28. 28. 28.]\n",
      "n_siblings_spouses  : [0 0 0 3 1]\n",
      "parch               : [0 0 0 1 1]\n",
      "fare                : [ 7.75   7.879  7.75  25.467 15.246]\n",
      "class               : [b'Third' b'Third' b'Third' b'Third' b'Third']\n",
      "deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']\n",
      "embark_town         : [b'Southampton' b'Queenstown' b'Queenstown' b'Southampton' b'Cherbourg']\n",
      "alone               : [b'y' b'y' b'y' b'n' b'n']\n"
     ]
    }
   ],
   "source": [
    "show_batch(raw_train_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "HSrYNKKcIdav"
   },
   "outputs": [],
   "source": [
    "example_batch, labels_batch = next(iter(temp_dataset)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "p5VtThKfGPaQ"
   },
   "source": [
    "So define a more general preprocessor that selects a list of numeric features and packs them into a single column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "5DRishYYGS-m"
   },
   "outputs": [],
   "source": [
    "class PackNumericFeatures(object):\n",
    "  def __init__(self, names):\n",
    "    self.names = names\n",
    "\n",
    "  def __call__(self, features, labels):\n",
    "    numeric_features = [features.pop(name) for name in self.names]\n",
    "    numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_features]\n",
    "    numeric_features = tf.stack(numeric_features, axis=-1)\n",
    "    features['numeric'] = numeric_features\n",
    "\n",
    "    return features, labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "1SeZka9AHfqD"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:AutoGraph could not transform <__main__.PackNumericFeatures object at 0x7f52c06f77b8> and will run it as-is.\n",
      "Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\n",
      "Cause: module 'gast' has no attribute 'Constant'\n",
      "To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert\n",
      "WARNING: AutoGraph could not transform <__main__.PackNumericFeatures object at 0x7f52c06f77b8> and will run it as-is.\n",
      "Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\n",
      "Cause: module 'gast' has no attribute 'Constant'\n",
      "To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert\n",
      "WARNING:tensorflow:AutoGraph could not transform <__main__.PackNumericFeatures object at 0x7f52c06f7438> and will run it as-is.\n",
      "Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\n",
      "Cause: module 'gast' has no attribute 'Constant'\n",
      "To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert\n",
      "WARNING: AutoGraph could not transform <__main__.PackNumericFeatures object at 0x7f52c06f7438> and will run it as-is.\n",
      "Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.\n",
      "Cause: module 'gast' has no attribute 'Constant'\n",
      "To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert\n"
     ]
    }
   ],
   "source": [
    "NUMERIC_FEATURES = ['age','n_siblings_spouses','parch', 'fare']\n",
    "\n",
    "packed_train_data = raw_train_data.map(\n",
    "    PackNumericFeatures(NUMERIC_FEATURES))\n",
    "\n",
    "packed_test_data = raw_test_data.map(\n",
    "    PackNumericFeatures(NUMERIC_FEATURES))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "wFrw0YobIbUB"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sex                 : [b'male' b'male' b'male' b'female' b'male']\n",
      "class               : [b'Third' b'Second' b'Third' b'First' b'First']\n",
      "deck                : [b'unknown' b'unknown' b'unknown' b'B' b'B']\n",
      "embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton' b'Cherbourg']\n",
      "alone               : [b'n' b'y' b'y' b'n' b'y']\n",
      "numeric             : [[ 4.     4.     2.    31.275]\n",
      " [16.     0.     0.    26.   ]\n",
      " [25.     0.     0.     7.05 ]\n",
      " [36.     0.     2.    71.   ]\n",
      " [32.     0.     0.    30.5  ]]\n"
     ]
    }
   ],
   "source": [
    "show_batch(packed_train_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "_EPUS8fPLUb1"
   },
   "outputs": [],
   "source": [
    "example_batch, labels_batch = next(iter(packed_train_data)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "o2maE8d2ijsq"
   },
   "source": [
    "#### Data Normalization\n",
    "\n",
    "Continuous data should always be normalized."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "WKT1ASWpwH46"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>n_siblings_spouses</th>\n",
       "      <th>parch</th>\n",
       "      <th>fare</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>627.000000</td>\n",
       "      <td>627.000000</td>\n",
       "      <td>627.000000</td>\n",
       "      <td>627.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>29.631308</td>\n",
       "      <td>0.545455</td>\n",
       "      <td>0.379585</td>\n",
       "      <td>34.385399</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>12.511818</td>\n",
       "      <td>1.151090</td>\n",
       "      <td>0.792999</td>\n",
       "      <td>54.597730</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.750000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>23.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>7.895800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>28.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>15.045800</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>35.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>31.387500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>80.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>512.329200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              age  n_siblings_spouses       parch        fare\n",
       "count  627.000000          627.000000  627.000000  627.000000\n",
       "mean    29.631308            0.545455    0.379585   34.385399\n",
       "std     12.511818            1.151090    0.792999   54.597730\n",
       "min      0.750000            0.000000    0.000000    0.000000\n",
       "25%     23.000000            0.000000    0.000000    7.895800\n",
       "50%     28.000000            0.000000    0.000000   15.045800\n",
       "75%     35.000000            1.000000    0.000000   31.387500\n",
       "max     80.000000            8.000000    5.000000  512.329200"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()\n",
    "desc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "cHHstcKPsMXM"
   },
   "outputs": [],
   "source": [
    "# TODO 1\n",
    "MEAN = # TODO: Your code goes here.\n",
    "STD = # TODO: Your code goes here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "REKqO_xHPNx0"
   },
   "outputs": [],
   "source": [
    "def normalize_numeric_data(data, mean, std):\n",
    "  # Center the data\n",
    "  # TODO 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[29.631  0.545  0.38  34.385] [12.512  1.151  0.793 54.598]\n"
     ]
    }
   ],
   "source": [
    "print(MEAN, STD)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "VPsoMUgRCpUM"
   },
   "source": [
    "Now create a numeric column. The `tf.feature_columns.numeric_column` API accepts a `normalizer_fn` argument, which will be run on each batch.\n",
    "\n",
    "Bind the `MEAN` and `STD` to the normalizer fn using [`functools.partial`](https://docs.python.org/3/library/functools.html#functools.partial)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Bw0I35xRS57V"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalize_numeric_data at 0x7f52c066f488>, mean=array([29.631,  0.545,  0.38 , 34.385]), std=array([12.512,  1.151,  0.793, 54.598])))"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# See what you just created.\n",
    "normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)\n",
    "\n",
    "numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])\n",
    "numeric_columns = [numeric_column]\n",
    "numeric_column"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "HZxcHXc6LCa7"
   },
   "source": [
    "When you train the model, include this feature column to select and center this block of numeric data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "b61NM76Ot_kb"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(5, 4), dtype=float32, numpy=\n",
       "array([[ 2.   ,  3.   ,  1.   , 21.075],\n",
       "       [28.   ,  1.   ,  0.   , 16.1  ],\n",
       "       [28.   ,  1.   ,  0.   , 19.967],\n",
       "       [16.   ,  0.   ,  1.   , 39.4  ],\n",
       "       [24.   ,  2.   ,  0.   , 24.15 ]], dtype=float32)>"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example_batch['numeric']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "j-r_4EAJAZoI"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-2.208,  2.132,  0.782, -0.244],\n",
       "       [-0.13 ,  0.395, -0.479, -0.335],\n",
       "       [-0.13 ,  0.395, -0.479, -0.264],\n",
       "       [-1.089, -0.474,  0.782,  0.092],\n",
       "       [-0.45 ,  1.264, -0.479, -0.187]], dtype=float32)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)\n",
    "numeric_layer(example_batch).numpy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "M37oD2VcCO4R"
   },
   "source": [
    "The mean based normalization used here requires knowing the means of each column ahead of time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tSyrkSQwYHKi"
   },
   "source": [
    "### Categorical data\n",
    "\n",
    "Some of the columns in the CSV data are categorical columns. That is, the content should be one of a limited set of options.\n",
    "\n",
    "Use the `tf.feature_column` API to create a collection with a `tf.feature_column.indicator_column` for each categorical column.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "mWDniduKMw-C"
   },
   "outputs": [],
   "source": [
    "CATEGORIES = {\n",
    "    'sex': ['male', 'female'],\n",
    "    'class' : ['First', 'Second', 'Third'],\n",
    "    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],\n",
    "    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],\n",
    "    'alone' : ['y', 'n']\n",
    "}\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "kkxLdrsLwHPT"
   },
   "outputs": [],
   "source": [
    "categorical_columns = []\n",
    "for feature, vocab in CATEGORIES.items():\n",
    "  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(\n",
    "        key=feature, vocabulary_list=vocab)\n",
    "  categorical_columns.append(tf.feature_column.indicator_column(cat_col))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "H18CxpHY_Nma"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),\n",
       " IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),\n",
       " IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),\n",
       " IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),\n",
       " IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# See what you just created.\n",
    "categorical_columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "p7mACuOsArUH"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]\n"
     ]
    }
   ],
   "source": [
    "categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)\n",
    "print(categorical_layer(example_batch).numpy()[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "R7-1QG99_1sN"
   },
   "source": [
    "This will be become part of a data processing input later when you build the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "kPWkC4_1l3IG"
   },
   "source": [
    "### Combined preprocessing layer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "R3QAjo1qD4p9"
   },
   "source": [
    "Add the two feature column collections and pass them to a `tf.keras.layers.DenseFeatures` to create an input layer that will extract and preprocess both input types:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "3-OYK7GnaH0r"
   },
   "outputs": [],
   "source": [
    "# TODO 1\n",
    "preprocessing_layer = # TODO: Your code goes here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "m7_U_K0UMSVS"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0.     1.     0.     0.     1.     0.     0.     0.     0.     0.\n",
      "  0.     0.     0.     0.     0.     0.     0.     0.    -2.208  2.132\n",
      "  0.782 -0.244  1.     0.   ]\n"
     ]
    }
   ],
   "source": [
    "print(preprocessing_layer(example_batch).numpy()[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "DlF_omQqtnOP"
   },
   "source": [
    "### Next Step\n",
    "\n",
    "A next step would be to build a build a `tf.keras.Sequential`, starting with the `preprocessing_layer`, which is beyond the scope of this lab.  We will cover the Keras Sequential API in the next Lesson."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load NumPy data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load necessary libraries \n",
    "First, restart the Kernel.  Then, we will start by importing the necessary libraries for this lab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TensorFlow version:  2.6.5\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "\n",
    "print(\"TensorFlow version: \",tf.version.VERSION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load data from `.npz` file\n",
    "\n",
    "We use the MNIST dataset in Keras."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATA_URL = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'\n",
    "\n",
    "path = tf.keras.utils.get_file('mnist.npz', DATA_URL)\n",
    "with np.load(path) as data:\n",
    "# TODO 1\n",
    "  train_examples = # TODO: Your code goes here.\n",
    "  train_labels = # TODO: Your code goes here.\n",
    "  test_examples = # TODO: Your code goes here.\n",
    "  test_labels = # TODO: Your code goes here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load NumPy arrays with `tf.data.Dataset`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assuming you have an array of examples and a corresponding array of labels, pass the two arrays as a tuple into `tf.data.Dataset.from_tensor_slices` to create a `tf.data.Dataset`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO 2\n",
    "train_dataset = # TODO: Your code goes here.\n",
    "test_dataset = # TODO: Your code goes here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Next Step\n",
    "\n",
    "A next step would be to build a build a `tf.keras.Sequential`, starting with the `preprocessing_layer`, which is beyond the scope of this lab.  We will cover the Keras Sequential API in the next Lesson."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Resources \n",
    "1. Load text data - this link: https://www.tensorflow.org/tutorials/load_data/text\n",
    "2. TF.text - this link:  https://www.tensorflow.org/tutorials/tensorflow_text/intro\n",
    "3. Load image daeta - https://www.tensorflow.org/tutorials/load_data/images\n",
    "4. Read data into a Pandas DataFrame - https://www.tensorflow.org/tutorials/load_data/pandas_dataframe\n",
    "5. How to represent Unicode strings in TensorFlow - https://www.tensorflow.org/tutorials/load_data/unicode\n",
    "6. TFRecord and tf.Example -  https://www.tensorflow.org/tutorials/load_data/tfrecord                  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2022 Google Inc.\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
    "http://www.apache.org/licenses/LICENSE-2.0\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "csv.ipynb",
   "private_outputs": true,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
